

## Computer Architecture

Hossein Asadi
Department of Computer Engineering
Sharif University of Technology
asadi@sharif.edu



## Today's Topics

- Pipelining
- Pipelining Hazards



## Copyright Notice

- · Parts (text & figures) of this lecture adopted from:
  - Computer Organization & Design, The Hardware/Software Interface, 3<sup>rd</sup> Edition, by D. Patterson and J. Hennessey, MK publishing, 2005.
  - "Intro to Computer Architecture" handouts, by Prof. Hoe, CMU, Spring 2009.
  - "Computer Architecture & Engineering" handouts, by Prof. Kubiatowicz, UC Berkeley, Spring 2004.
  - "Intro to Computer Architecture" handouts, by Prof. Hoe, UWisc, Spring 2019.
  - "Computer Arch I" handouts, by Prof. Garzarán, UIUC, Spring 2009.
  - "Intro to Computer Organization" handouts, by Prof. Mahlke & Prof. Narayanasamy, Winter 2008.



# Instruction Latencies and Throughput

•Single-Cycle CPU

Load IF ID/RR Exec Mem Wr

#### •Multiple Cycle CPU

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Load IF ID/RR Exec Mem Wr

#### Pipelined CPU

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 ID/RR Exec Wr Load Mem Ifetch ID/RR Exec Mem Wr Load ID/RR Exec Load Ifetch Mem Wr Ifetch ID/RR Exec Mem Wr Load



## Introduction of Pipeline

Machine assembly line

Datapath of pipeline







Execution in a Pipelined Datapath CC1 CC3 CC4 CC<sub>6</sub> CC7 CC8 CC2  $\mathbf{IF}$ **MEM** IM DM Reg Reg W **MEM** WB Reg DM IM Reg W IM Reg DM Reg W Reg DM IM Reg W IM Reg DM Sharif University of Technology, Spring 2019 Lecture 8



## Mixed Instructions in Pipeline





## Pipeline Principles

- Principles
  - All instructions that share a pipeline must have same stages in same order
    - · add does nothing during Mem stage
    - · -> sw does nothing during WB stage





## Pipeline Principles (cont.)

- Principles
  - All intermediate values must be latched each cycle
  - No functional block reuse for an instruction
    - E.g. need 2 adders and ALU (like in single-cycle)





## Pipeline Principles

- · Question:
  - What if we wanted to bypass a stage for an instruction?











lw \$12, 1000(\$4) add \$10, \$1, \$2 sub \$15, \$4, \$1 **Memory Access** IF/ID ID/EX EX/MEM MEM/WB Add left 2 Read register 1 Instruction Address Read data: Read Zero register 2 Instruction Registers Read memory Read Write Address data 2 register M u Data Write memory Write Sign extend

Sharif University of Technology, Spring 2019

15

Lecture 8



## Pipelining Performance

- ET = IC \* CPI \* CT
- Achieve High throughput
  - Without reducing instruction *latency*
- Example:
  - A CPU that takes 5ns to execute an instruction pipelined into 5 equal stages
    - · Latch between each stage has a delay of 0.25 ns
  - 1. Min. clock cycle time of this arch?
  - 2. Max. speedup that can be achieved by this arch (compared to single cycle arch)?



## Issues With Pipelining

- Pipelining Creates Potential Hazards
  - What happens if two instructions need a same structure?
    - Structure hazard
  - What happens when an instruction needs result of another instruction?
    - Data hazard
  - What happens on a branch?
    - · Control hazard



### Structural Hazards

- · How?
  - Two instructions require use of a given hardware resource at same time
- Access to Memory
  - Separate instruction and data caches
- Access to Register File
  - Multiple port register file







Data Hazard: result needed before it is available

Lecture 8

Sharif University of Technology, Spring 2019

## Handling Data Hazards

- SW Solution
  - Insert independent instructions
  - No-ops instructions
- HW Solutions
  - Insert bubbles (i.e. stall pipeline)
  - Data forwarding
  - Latch-based GPR



## Dealing With Data Hazards

- Using Transparent Register File
  - Use latches rather than flip-flops in RF
    - · First half-cycle: register loaded
    - · Second half-cycle: new value is read into pipeline state



# Handling Data Hazards in Hardware: Stall pipeline







## Reducing Data Hazards Through Forwarding



Avoid stalling by forwarding ALU output from "add" to ALU input for "or"





### Control or Branch Hazard



## Stalling for Branch Hazards





## Reducing Branch Delay

It's easy to reduce stall to 2-cycles





## Stalling for Branch Hazards





One-Cycle Branch Misprediction Penalty



Target computation & equality check in ID phase
 Lecture 8
 Sharif University of Technology, Spring 2019

## Stalling for Branch Hazards





#### Practice

- Which program has more IC?
- Which one has less bubbles?
- Which one runs faster?
- How many clock cycles?

## Eliminating Branch Stall

- SPARC and MIPS
  - Use a single branch delay slot to eliminate single-cycle stalls after branches
  - Instruction after a conditional branch is *always* executed in those machines
    - Regardless of whether branch is taken or not!



## Branch Delay Slot





Branch delay slot instruction (next instruction after a branch) is executed even if the branch is taken.

Lecture 8

## Filling Branch Delay Slot

```
add $5, $3, $7
sub $6, $1, $4
and $7, $8, $2
beg $6, $7, there
nop /* branch delay slot */
add $9, $1, $2
sub $2, $9, $5
there:
mult $2, $10, $11
```

## Backup



